-
-
Notifications
You must be signed in to change notification settings - Fork 73
Make Timeout repeatedly interrupt a stuck thread, and note that this is happening #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Object hashes were same across 9 threaddumps with a 5-sec interval, so most likely this is the first case |
abayer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, but I'd like @svanoort's thoughts.
Which object hashes? Anyway, @reviewbybees done for now. |
|
By the way before implementing this (so using the original message, waited. The timeout was enforced to the extent that individual calls to with no further human intervention. Could probably develop this into an automated test in |
svanoort
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong feeling one way or the other -- it's hard to tell what the impact will be until we see it in the wild. I'm slightly nervous about using Timer threads more, given issues we've seen with the bounded threadpool, but also trying harder to kill threads seems smart too?
The interactions with the pipeline threading model are hard to predict with precision.
I do think we want to think hard about simplifying the pipeline threading model in the mid-term future because the chain of causality has become rather esoteric.
@jglick I have no plans to modify that in durability as of yet.
|
Fwiw, I think this is a problem that shows up on ci.jenkins.io pretty often, so a fix is nice to see. |
Correct.
Not sure what you mean.
If you have any suggestions let me know, but note that jenkinsci/workflow-durable-task-step-plugin#21 would eliminate most of the
If you have the ability to track down particular instances and investigate, that would be important work. I for one have no permissions on the server. |
…as a support bundle showed that all Timer threads were occupied.
Turns it out was taking longer than 10s—but no interruption was being delivered, because the other nine |
oleg-nenashev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐝 nice catch
|
@reviewbybees done FWIW |
|
BTW the actual case of the agent-side hang was believed to be JENKINS-39179 in this case. |
svanoort
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I like it with the new modification. Yeah, this is some good stuff.
Saw a support bundle that showed a thread inside
From that it was unclear whether this one call to
FilePath.deleteRecursivewas in fact taking more than the 10s allotted, or if there were lots of repeated calls torun/check, or what. This patchInterruptedException, a new signal will be sent.I tried to also set the thread name to indicate how stalled it is, to allow this information to appear in thread dumps, but could not get it to work—whether due to a JVM bug or some oversight on my part, I am not sure.
@reviewbybees